In the following exercises we will wrangle some data on population from Gapminder.

1

Read in the Gapminder population data and store it as a new object called gap_pop.
The data we need is stored in a .csv file which you can find in the folder data/gapminder.

As you may have noticed, the name of the first column in the dataset does not match its content.

2

Rename the variable Total population to country and store the result in an object with the same name (gap_pop).

As you have probably noticed, the data are currently in wide format.

3

Using the data in wide format, select only data for the years 1990 to 1999.
As the values for each year are in separate columns, you need to use the select() function.

For the next data wrangling steps it is more convenient to have the data in long format.

4

Transform the gap_pop dataset into a sensible long format. Name the variable representing the values for population pop and store the resulting dataframe in a name with the same object as before (gap_pop).
This is just a repetition from the Tidy Data exercises. What we want to do is to gather the columns with the years into a year variable.

For some analyses it might help (or even be necessary) to only work with a specific subset of observations.

4

Create two new dataframes that include different subets of the gap_pop data: 1. Data for all countries for the 19th century (name this one gap_pop_19thcen), 2. Data for Germany for the years from 2000 onwards (name this one gap_gop_ger_21stcen).
There are several ways to filter the observations according to the above instructions. However, some require more typing than others.

For some analyses as well as for plotting the data, it makes sense to define the country variable as a factor.

6

Change the variable types of the dataset: country should be a factor, year and pop should be integers. Again, keep the object name for the resulting dataframe.
You need to use the mutate() function for this.

Let’s imagine that we want to combine the population data that we have with some other country-level data (we will discuss joining datasets in session B2 on relational data tomorrow). If the data come from different sources, it is quite likely that the names of the countries differ between them. If we want to join the datasets, we need to harmonize the country names.

7

The gap_pop dataset contains data for the countries Cook Is, Kyrgyz Republic, and Micronesia, Fed. Sts.. Rename them to Cook Island, Kyrgyzstan, and Micronesia.
As we want to use the country variable for joining the datasets in this hypothetical example, we should recode into the same variable.

Of course, instead of changing the variable types at this point, we could have also specified the column types when reading in the data (see the session and exercises on importing data).

In the next step, we want to create some new variables based on ones that already exist in the dataset.

8

  1. Create the variable Population in thousands in the gap_pop dataset (name the new variable pop_in_thousands).

  2. Compute the percentage change in population since the previous year for the gap_pop_ger_21stcen dataset (name this new variable pop_perc_change).
To compute the percentage change variable you need the lag() function.

Finally, let’s combine two basic data wrangling steps to answer an actual question with the data.

9

Which 5 countries were the most and least populous ones in 2015?
To answer this question you need to filter and arrange the gap_pop dataset.